#install.packages("pacman")
pacman::p_load(tidyverse, here)
#tidyverse required for tidy workflows
#rio required for importing and exporting data
#here required for managing file paths15 Exploring Data with R
To recap what we learnt in the previous sessions.. we now know to work within the R Project environment. here::here() makes it easy for us to manage file paths. You can quickly have a look at your data using the View() and glimpse() functions. Most of the tidy data is read as tibble which is a workhorse of tidyverse.
15.1 Getting Started with the Data Exploration Pipeline
15.1.1 Set-up
Note
The shortcut for code commenting is Ctrl+Shift+C.
15.1.2 Load Data
The dataset we will be working with has been cleaned (to an extent) for the purposes of this workshop. It is a dataset about NHANES that has been took from the
NHANESand cleaned up and modified for our use.
# Check the file path
here::here("data", "nhanes_modified_df.rds")[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book/data/nhanes_modified_df.rds"
# Read Data
df <- read_rds(here("data", "nhanes_modified_df.rds"))Try the following functions using tb as the argument:
glimpse()head()names()
Now, we will be introducing you to two new packages:
dplyrskimrDataExplorer
15.2 dplyr Package
The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.
There are many verbs in dplyr that are useful, some of them are given here…
15.2.1 Getting used to the pipe |> or %>%
Note
The pipe |> means THEN…
The pipe is an operator in R that allows you to chain together functions in dplyr.
Let’s find the bottom 50 rows of tb without and with the pipe.
Tips The native pipe |> is preferred.
#without the pipe
tail(df, n = 50)
#with the pipe
df |> tail(n = 50)Now let’s see what the code looks like if we need 2 functions. Find the unique age in the bottom 50 rows of df
#without the pipe
unique(tail(df, n = 50)$age)
# with the pipe
df |>
tail(50) |>
distinct(age)Note
The shortcut for the pipe is Ctrl+Shift+M
You will notice that we used different functions to complete our task. The code without the pipe uses functions from base R while the code with the pipe uses a mixture (tail() from base R and distinct() from dplyr). Not all functions work with the pipe, but we will usually opt for those that do when we have a choice.
15.2.2 distinct() and count()
The distinct() function will return the distinct values of a column, while count() provides both the distinct values of a column and then number of times each value shows up. The following example investigates the different race (race) in the df dataset:
df |>
distinct(race)
df |>
count(race)Notice that there is a new column produced by the count function called n.
15.2.3 arrange()
The arrange() function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc() function to arrange by descending.
The following code would get the number of times each race is in the dataset:
df |>
count(race) |>
arrange(n)
# Since the default is ascending order,
# we are not getting the results that are probably useful,
# so let's use the desc() function
df |>
count(race) |>
arrange(desc(n))
# shortcut for desc() is -
df |>
count(race) |>
arrange(-n)15.2.4 filter()
If you want to return rows of the data where some criteria are met, use the filter() function. This is how we subset in the tidyverse. (Base R function is subset())
Here are the logical criteria in R:
==: Equal to!=: Not equal to>: Greater than>=: Greater than or equal to<: Less than<=: Less than or equal to
If you want to satisfy all of multiple conditions, you can use the “and” operator, &.
The “or” operator | (the vertical pipe character, shift-backslash) will return a subset that meet any of the conditions.
Let’s see all the data for age 60 or above
df |>
filter(age >= 60)Let’s just see data for white
df |>
filter(race == "White")Both White and age 60 or more
df |>
filter(age >= 60 & race == "White")Lets create a column called ind_sub if the country is in the Indian Subcontinent.
# use our vector indian_subcont that we created before
tb |>
mutate(indian_sub1 = if_else(country %in% indian_subcont,
"Indian Subcontinent", "Others"))The same thing can be done using case_when().
tb |>
mutate(indian_sub2 = case_when(country %in%
indian_subcont ~
"Indian Subcontinent",
TRUE ~ "Other")) Lets do it again, but this time let us make it 1 and 0, 1 if it is a country in the Indian Subcontinent, 0 if otherwise.
tb |>
mutate(indian_sub3 = case_when(country %in% indian_subcont ~ 1,
TRUE ~ 0))Note
The if_else() function may result in slightly shorter code if you only need to code for 2 options. For more options, nested if_else() statements become hard to read and could result in mismatched parentheses so case_when() will be a more elegant solution.
As a second example of case_when(), let’s say we wanted to create a new population variable that is low, medium, or high.
See the pop broken into 3 equally sized portions
quantile(tb$pop, prob = c(.33, .66))Note
See the help file for quanile function or type ?quantile in the console.
We’ll say:
- low pop = 2043237 or less
- med pop = between 2043237 and 11379155
- high pop = above 11379155
tb |>
mutate(popcat = case_when(pop <= 2043237 ~ "low",
pop > 2043237 & pop <= 11379155 ~ "med",
TRUE ~ "high"))15.2.5 join()
Typically in a data science or data analysis project one would have to work with many sources of data. The researcher must be able to combine multiple datasets to answer the questions he or she is interested in. Collectively, these multiple tables of data area called relational data because more than the individual datasets, its the relations that are more important.
As with the other dplyr verbs, there are different families of verbs that are designed to work with relational data and one of the most commonly used family of verbs are the mutating joins.
These include:
left_join(x, y)which combines all columns in data framexwith those in data frameybut only retains rows fromx.right_join(x, y)also keeps all columns but operates in the opposite direction, returning only rows fromy.full_join(x, y)combines all columns ofxwith all columns ofyand retains all rows from both data frames.inner_join(x, y)combines all columns present in eitherxorybut only retains rows that are present in both data frames.anti_join(x, y)returns the columns fromxonly and retains rows ofxthat are not present iny.anti_join(y, x)returns the columns fromyonly and retains rows ofythat are not present inx.
Apart from specifying the data frames to be joined, we also need to specify the key column(s) that is to be used for joining the data. Key columns are specified with the by argument, e.g. inner_join(x, y, by = "subject_id") adds columns of y to x for all rows where the values of the “subject_id” column (present in each data frame) match. If the name of the key column is different in both the dataframes, e.g. “subject_id” in x and “subj_id” in y, then you have to specify both names using by = c("subject_id" = "subj_id").
15.2.6 pivot()
Most often, when working with our data we may have to reshape our data from long format to wide format and back. We can use the pivot family of functions to achieve this task. What we mean by “the shape of our data” is how the values are distributed across rows or columns. Here’s a visual representation of the same data in two different shapes:
“Long” format is where we have a column for each of the types of things we measured or recorded in our data. In other words, each variable has its own column.
“Wide” format occurs when we have data relating to the same measured thing in different columns. In this case, we have values related to our “metric” spread across multiple columns (a column each for a year).
Let us now use the pivot functions to reshape the data in practice. The two pivot functions are:
pivot_wider(): from long to wide format.pivot_longer(): from wide to long format.
Resources for learning more dplyr
Check out the Data Wrangling cheatsheet that covers dplyr and tidyr functions.(https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
Review the Tibbles chapter of the excellent, free R for Data Science book.(https://r4ds.had.co.nz/tibbles.html)
Check out the Transformations chapter to learn more about the dplyr package. Note that this chapter also uses the graphing package ggplot2 which we have covered yesterday.(https://r4ds.had.co.nz/transform.html)
Check out the Relational Data chapter to learn more about the joins.(https://r4ds.had.co.nz/relational-data.html)
Exercise
In 2007, which 10 countries had the highest incidence_100k?
Within the South East Asia who_region, which countries have incidence_100K > 300?
How many countries are in each who_region? Put the output in order from lowest to highest number of countries. Hint: use distinct() and arrange()
Which country in which year has the highest incidence of tuberculosis?
Excluding missing values on hiv_incidence_100k, what is the correlation coefficient for the relationship between the tb incidence_100k and the hiv_incidence_100k for each region. Show the output with the highest correlations first. Hint: use cor()
15.3 skimr Package
skimr is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The core function of skimr is the skim() function, which is designed to work with (grouped) data frames, and will try coerce other objects to data frames if possible.
Give skim() a try.
tb |> skimr::skim()Check out the names of the output of skimr
tb |> skimr::skim() |> names()Also works with dplyr verbs
tb |> group_by(who_region) |> skimr::skim()tb |> skimr::skim() |>
dplyr::select(skim_type, skim_variable, n_missing)15.4 DataExplorer Package
The DataExplorer package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.1
The single most important function from the DataExplorer package is create_report()
Try it for yourself.
library(DataExplorer)
create_report(tb)Exercise
Create a report on Exploratory Data Analysis using RMarkdown
Use DataExplorer to explore the given data